[pull] master from DataDog:master#533
Merged
Merged
Conversation
* fix teamcity tests failure * pin mockserver version to 5.15
* Add GPU cost overview dashboards Adds 5 dashboards to the GPU integration for monitoring GPU compute spend and utilization across cloud providers and Kubernetes. Dashboards: - gpu_cost_overview: cross-cloud totals, spend by team/env, fleet utilization - aws_gpu_cost_overview: AWS-specific with Capacity Block tracking - azure_gpu_cost_overview: Azure GPU VM families (NC, NCv3, ND series) - gcp_gpu_cost_overview: GCP GPU SKUs with On-Demand vs Committed coverage - k8s_gpu_cost_overview: cluster/namespace allocation, idle cost attribution Cost queries span cloud_cost amortized metrics plus unblended for AWS Capacity Blocks (which are not captured in amortized). Utilization widgets join GPU telemetry (gpu.sm_active, gpu.device.total) with cost data for unit-economics views. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Refine GPU utilization metrics and dashboard polish - Switch unhealthy KPIs to GPU Idle % (gr_engine_active) since gpu.device.unhealthy is non-functional - Add Healthy GPU Rate KPI on k8s using kubernetes_state.node.gpu_allocatable / gpu_capacity - Add Spend on Idle GPUs KPI to per-cloud and overview dashboards (excludes AWS Capacity Blocks since their cost is upfront and unrelated to engine activity) - Standardize utilization terminology: "Average GPU Utilization %" and "GPU Idle %" across dashboards - Drop redundant cloud provider prefix from per-cloud widget titles and remove the "GPU Spend" group wrapper to match k8s dashboard layout Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add GPU Monitoring setup CTA banner to per-cloud and k8s dashboards Mirrors the existing banner from the overview dashboard so users on any cloud-specific or Kubernetes view see the link to enable GPU Monitoring, which populates utilization-driven widgets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…23609) * fix(nutanix): normalize all ntnx_* tags to lowercase with $unknown fallback Route every enum-backed ntnx_* tag (host_type, hypervisor_type, node_status, plus the previously-handled state tags) through a single _norm_state helper so they all follow one rule: lowercase the API value, fall back to "\$unknown" when the source is missing. Picks "\$unknown" (the API spec's own sentinel) as the fallback so there's no mismatch between "value present but says \$UNKNOWN" and "value missing" — both surface as ntnx_X:\$unknown. ntnx_disk_status's "unknown" fallback is updated to "\$unknown" for the same reason. * docs(nutanix): add changelog for tag normalization * docs(nutanix): shorten changelog to one customer-facing line * refactor(nutanix): extract tag values into named variables Hoist _norm_state and get_nested calls out of f-strings in the tag-extraction helpers. Each tag computation now binds to a named local first, making the read top-down and easier to step through. * refactor(nutanix): rename _norm_state to _normalize_tag_value The helper is used for type, state, mode, and status tags — not just state — so the broader name better describes what it does. * refactor(nutanix): collapse node_status to one variable, restore tags = [] Lowercase the node-status comparison sets so the normalized tag value serves both the status_value lookup and the tag emission, removing the need for a separate node_status_tag local. Restore the tags = [] preamble in the tag-extraction helpers since it makes the building intent obvious. * fix(nutanix): normalize powerState in vm.status gauge Match what _report_host_status_metrics does: route the powerState lookup through _normalize_tag_value and lowercase the comparison literals. Removes the asymmetry where vm.status was the only metric still relying on raw uppercase API values. Addresses review feedback on PR #23609. * refactor(nutanix): normalize disk statuses in _aggregate_disk_status Lowercase DEGRADED_DISK_STATUSES at the constant and route disk ``status`` values through ``_normalize_tag_value`` so the comparison surface matches the rest of the module. Aligns with the convention introduced in ``_report_host_status_metrics``. * test(nutanix): cover \$unknown fallback for host enum tags Verify ``ntnx_host_type``, ``ntnx_hypervisor_type``, and ``ntnx_node_status`` emit ``\$unknown`` when ``hostType``, ``hypervisor.type``, and ``nodeStatus`` are absent from the host payload, the always-emit behavior introduced in this PR. * test(nutanix): align unknown-fallback test with repo conventions
delete 'pause_auto_refresh'
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )